01 First Model

Little data exploratin and a VERY simple OLS model

In [1]:
import pandas as pd
import pandas_profiling as pp
In [2]:
df = pd.read_csv("../data/raw/train.csv")
In [3]:
df.head()
Out[3]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 12 2008 WD Normal 250000

5 rows × 81 columns

In [4]:
pr = pp.ProfileReport(df)
In [5]:
pr
Out[5]:

In [6]:
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
In [7]:
import matplotlib.pyplot as plt
%matplotlib inline  

First model

Very first model, lets try something realy simple:

  • pick top 10 most highly correlated features with the target.
  • fill any missing values with the mode
  • Fit an OLS regression
In [23]:
df.corrwith(df['SalePrice']).sort_values(ascending=False).index[1:10]
Out[23]:
Index(['OverallQual', 'GrLivArea', 'GarageCars', 'GarageArea', 'TotalBsmtSF',
       '1stFlrSF', 'FullBath', 'TotRmsAbvGrd', 'YearBuilt'],
      dtype='object')
In [24]:
feats = pr.description_set['correlations']['spearman']["SalePrice"].sort_values(ascending=False).index[1:10]
target = "SalePrice"
In [25]:
df['GarageYrBlt'] = df['GarageYrBlt'].fillna(df['GarageYrBlt'].mode()[0])
In [26]:
pp.ProfileReport(df[feats])
Out[26]:

In [27]:
X_train, X_test, y_train, y_test = train_test_split(df[feats], df[target], test_size=0.3)
In [28]:
rgr = linear_model.LinearRegression().fit(X_train, y_train)
In [29]:
mean_squared_error(y_test, rgr.predict(X_test))**0.5
Out[29]:
46099.7877852043